Heart failure (HF) is a chronic cardiac disease that has progressively severe implications on the patient’s quality of life, and it has been a leading cause of death and a significant contributor to the global burden of disease.1 The pathophysiology of HF is complex and involves multiple factors, but a major hallmark of HF is the progressive degeneration of the myocardium that is accompanied by cardiac fibrosis.2 34 During the progression of HF, a process known as cardiac remodelling progressively replaces the myocardium with fibrous tissues, which in turn leads to the gradual stiffening of the heart ventricular wall. Thus, the identification of profibrotic pathways are essential to the development of therapeutic strategies to delay or even to prevent the profibrotic remodelling of the heart.
Previous studies have shown that the straining condition has played a significant role in potently altering the Hippo pathways.5 It was found that mechanical stress on the cell cytoskeleton induces the nuclear entry of the YAP/TAZ complex, which subsequently activates the proliferative pathways.6 7 Although such a mechanism may seem to be beneficial for the restoration of the damaged cardiomyocytes, this pathway has been shown to be more active in cardiac stromal cells, which can subsequently differentiate into myofibroblasts that eventually leads to the profibrotic remodelling of the heart.2 Therefore, directly suppressing the YAP/TAZ complex in cardiac stromal cells may be an effective strategy. Recently, a study by Garoffolo et al. investigated the use of a small molecular inhibitor, verteporfin (VTP), to suppress the YAP/TAZ complex with or without the presence of TGF-β1, a profibrotic cytokine under the maximal strain condition.
The study by Garoffolo et al. aimed to examine the antifibrotic effect of VTP in cardiac stromal cells under profibrotic signals that simulates the straining conditions under heart failure as part of their larger objective.2
Here, we focus on this particular aspect of the research discussed above. In particular, we aim to use RNA-seq data to identify the changes in the transcriptional profile of cardiac stromal cells after the use of VTP. Figure 1. shows the experimental design of the study. Here, we perform differential gene expression analysis comparing the transcriptome of the cells treated with VTP with control under the known profibrotic straining condition. Additionally, we test the efficacy of the drug (VTP) under the presence of TGF-β1, a potent profibrotic cytokine.8 To investigate whether direct inhibition of the YAP/TAZ complex with VTP, we perform threshold over-representation analysis to identify enriched pathways in the differentially expressed genes, which would allow us to identify whether VTP can effectively reverse a profibrotic phenotype.
Figure 1. Schematics of RNAseq experiment design. Samples were cultured under maximal mechanical strains. Six independent cell cultures were used for each treatment condition. cSt-Cs: cardiospheres-derived primitive cardiac stromal cells; VTP: verteporfin; TGF-β1: transforming growth factor beta-1. Figure adapted from Assignment 1 and created with BioRender.com.
The data used in the present analysis is retrieved from the GEO database under the accession number GSE203358. The dataset contains RNA-seq data from 24 samples, which are evenly split into 4 groups: - Control (maximal strain condition) - Verteporfin (VTP) treatment (maximal strain condition) - TGF-β1 treatment (maximal strain condition) - Verteporfin (VTP) treatment with TGF-β1 (maximal strain condition)
Since we have previously performed initial data cleaning and normalization, here we provide a brief summary of the previous analysis. The entire process can be found here.
The original dataset is composed of 24 samples and
60583genes. After removing genes with low counts, the
dataset we use in the present analysis is composed of 14814
genes. The reference genome used is the human GRCh38
genome.
The original dataset contains a mix of HUGO gene symbols and GenBank identifiers. We used the biomaRt package to map the rest of the identifiers to HUGO gene symbols.9 The dataset is normalized using the Trimmed Mean of M-values (TMM) method implemented by the edgeR package.10 After applying normalization, the library size for each sample has been effectively adjusted such that the distribution of gene expressions are comparable across samples.
We evaluated the similarity between the samples by accessing the
clustering of the samples (as well as replicates within each group)
using multidimensional scaling (MDS). The MDS plot shown in Figure 2. reveals a moderate clustering of samples
based on the treatment conditions, and in particular, the control
samples under mechanical strain alone are clustered separately with any
other groups. The other treatment groups show less separation along the
first dimension, but clustering is still observed. Interestingly,
although most of the samples are clustered based on the treatment
conditions and that we did not observe any clustering based on the cell
culture replicates, we notice that certain samples that do not cluster
tightly with the rest in the same treatment group seem to be consistent
across the conditions. For example, comparing to other replicates, in
cells from culture M89 and M99, different
conditions seem to have a smaller effect on the gene expression
profile.
Figure 2. Multidimensional scaling plot of cSt-Cs gene expression across samples. All samples are coloured by treatment condition. The first two dimensions are shown. Moderate clustering by treatment and low clustering by replicate for a small number of cultures is observed. Figure adapted from Assignment 1.
The current R Notebook assumes that it is being run on the BCB420-2023 base image. Additional packages required are noted below.
# edgeR package for manupulating and analyzing RNAseq data
if (!requireNamespace("edgeR", quietly = TRUE)) {
BiocManager::install("edgeR")
}
# limma package for differential expression analysis
if (!requireNamespace("limma", quietly = TRUE)) {
BiocManager::install("limma")
}
# ComplexHeatmap package for creating heatmaps
if (!requireNamespace("ComplexHeatmap", quietly = TRUE)) {
BiocManager::install("ComplexHeatmap")
}
# circlize package for circular visualization
if (!requireNamespace("circlize", quietly = TRUE)) {
BiocManager::install("circlize")
}
# ggplot2 package for creating plots
if (!requireNamespace("ggplot2", quietly = TRUE)) {
install.packages("ggplot2")
}
# KableExtra package for creating tables and formatting
if (!requireNamespace("kableExtra", quietly = TRUE)) {
install.packages("kableExtra")
}
# dplyr package for data manipulation
if (!requireNamespace("dplyr", quietly = TRUE)) {
install.packages("dplyr")
}
# DT package for creating interactive tables
if (!requireNamespace("DT", quietly = TRUE)) {
install.packages("DT")
}
# Load the dplyr package
library("dplyr")
# Load the ggplot2 package
library("ggplot2")
# We will consistently use package::function() to avoid confusion However, the
# dplyr package has to be loaded first to allow definitions of various
# operators such as %>%In the previous analysis, we have saved the processed data output to a file. Hence, we can simply load the data from the file and perform the analysis.
# Load the processed data
counts <- read.csv(file = "./Data/expr_norm_counts.csv", header = TRUE, row.names = 1)
# View the first few rows of the data
head(counts)Table 1. Processed count data used for the present analysis. The data is composed of 14814 genes and 24 samples, where each gene is identified by a unique HUGO symbol. The first few rows are shown.
Since the header of the data contains the sample information, we will extract the sample information from the header to reconstruct the experimental design.
# Extract the sample information from the header
sample_matrix <- colnames(counts) %>%
lapply(function(x) {
unlist(strsplit(x, split = "\\."))
}) %>%
as.data.frame()
# Rename the data frame
colnames(sample_matrix) <- colnames(counts)
rownames(sample_matrix) <- c("culture", "treatment")
# Transpose the data frame so that each row represents a unique sample and each
# column represents a unique sample attribute
sample_matrix <- data.frame(t(sample_matrix))
# View the sample information
DT::datatable(sample_matrix[order(sample_matrix$treatment), ], rownames = TRUE, filter = "top",
options = list(pageLength = 6)) %>%
# Changing the font size of content
DT::formatStyle(names(sample_matrix), fontSize = "12px") %>%
# Changing the font size of row names (sample names)
DT::formatStyle(0, fontSize = "12px")Table 2. Sample information used for the present analysis. A total of 24 samples are included in the analysis, where each sample is identified by a unique culture ID and treatment condition. Cells are retrieved from 6 independent cultures, and each is used for 4 different treatment conditions.
Understanding the antifibrotic effect of VTP requires a comprehensive analysis of the gene expression profile of the cSt-Cs by identifying the transcriptional changes with or without the treatment of VTP. To this end, we will perform two separate differential gene expression analysis with or without the presence of TGF-β1, in which we can isolate the effect of VTP as the only independent variable.
However, although our data is composed of a total of 4 treatment condition, which allows for the comparison between control (strain alone) and TGF-β1, we will not perform such analysis. It has been well documented that TGF-β1 is a potent profibrotic cytokine that will induce a series of transcriptional activation on profibrotic pathways.11 The authors of the original publication has also verified the profibrotic effect of TGF-β1.8 We have also shown that TGF-β1 significantly upregaltes the expression of profibrotic genes (see associated journal entry for details). Therefore, based on our objective, we believe that it is more valuable to investigate the effect of VTP under maximal strain, both with and without the presence of TGF-β1 separately.
Before we perform the differential gene expression analysis, we first want to observe the global expression change at the gene level. Here, we plot a heatmap for all the genes in the dataset to visualize the expression change across all samples under various conditions.
# Creating a DGEList object
counts_dge <- edgeR::DGEList(counts = counts, group = sample_matrix$treatment)
# As discussed in the associated journal entry, the data we use are the counts,
# rather than the counts per million (CPM) values. Therefore, we will need to
# convert the count to CPM values for normalization on library size.
counts_dge <- edgeR::calcNormFactors(counts_dge)
counts_cpm_all <- edgeR::cpm(counts_dge)# Create a heatmap matrix for all the genes by scaling the CPM values This
# allows the heatmap colour to be on a consistent and continuous scale across
# all the samples
heatmap_matrix <- counts_cpm_all %>%
t() %>%
scale()
# Convert it back to the original format
heatmap_matrix <- t(heatmap_matrix)# Checking the range of the heatmap matrix
hmap_range <- range(heatmap_matrix)
hmap_range## [1] -3.54320 4.65571
The range of the heatmap matrix is between -3.5432004 and 4.6557105.
Since the scaled expression value ranges from negative to positive, we
will define a three colour palette for the heatmap, where we will use
blue to represent the downregulated genes and
red for upregulated genes. In the middle of the scale is
white.
Here, heatmap creation and visual representation are implemented
using the ComplexHeatmap and circlize
packages, respectively.12 13
# Set the 3 color palette based on the scaled heatmap matrix
hmap_col <- circlize::colorRamp2(c(hmap_range[1], 0, hmap_range[2]), c("blue", "white",
"red"))Due to the relatively large number of samples, we will also apply annotations to the heatmap to better visualize the clustering of the samples. In particular, we will annotate the heatmap by the treatment condition and culture replicate. This would better allows us to identify whether any of the variables have a significant effect on the clustering of the samples.
# Sort the sample matrix in the same order as the heatmap matrix
sample_matrix_sorted <- sample_matrix[colnames(heatmap_matrix), ]
# Treatment Conditions
treatments <- unique(sample_matrix_sorted$treatment)
# Define 4 colours for the 4 treatment conditions
treatment_col <- hcl.colors(4, palette = "viridis")
names(treatment_col) <- treatments
# Culture Replicates
cultures <- unique(sample_matrix_sorted$culture)
# Define 6 colours for the 6 culture replicates
culture_col <- hcl.colors(6, palette = "set3")
names(culture_col) <- cultures
# Create an annotation dataframe
anno_df <- data.frame(treatment = sample_matrix_sorted$treatment, culture = sample_matrix_sorted$culture)
anno_col <- list(treatment = treatment_col, culture = culture_col)
# Create the heatmap annotation
hmap_anno <- ComplexHeatmap::HeatmapAnnotation(df = anno_df, col = anno_col)To better visualize the heatmap, we cluster the genes into 4 groups using the k-means clustering algorithm. This allows the genes that have similar expression profiles to be grouped together.14 12 Additionally, since our dataset consists of 4 treatment conditions, it is expected that a clustering of 4 groups will reveal some interesting patterns in the expression profiles.
# Generate the heatmap Set the seed for reproducibility due to he inherent
# randomness of the k-means clustering algorithm
set.seed(91711)
hmap_all <- ComplexHeatmap::Heatmap(heatmap_matrix, show_row_dend = TRUE, show_column_dend = TRUE,
show_column_names = TRUE, show_row_names = FALSE, col = hmap_col, top_annotation = hmap_anno,
km = 4)
set.seed(NULL)
# Plot the heatmap
hmap_allFigure 3. Heatmap of global gene expression across
all samples of different treatments. A k-mean clustering of 4 groups was
applied to the heatmap for clusterization of gene with similar
expression profiles. Samples are hierarchically clustered by their
expression profiles. The top bar annotate the samples by treatment
conditions and culture in which the cells were extracted.
The 4 groups of genes generated by the k-means clustering algorithm show comparatively distinct expression profiles that define 4 subset (cluster) of samples. However, each of the 4 groups of samples (as indicated by the dendrogram) consists of a mix of samples from different treatment conditions. Based on the observation, we would not expect a very high number of differentially expressed gene between the conditions. However, despite the mix of treatment conditions within each cluster, it is still observable that samples under certain conditions tend to cluster together. For example, the left most cluster consists of samples that are treated with VTP. We can also observe a clustering of a subset of control cells and those that are treated with TGF-β1 alone. Since these two conditions are expected to show a profibrotic transcriptional programme, it is expected that these cells share similar expression patterns.
In addition, the annotation based on the culture replicate also shows a moderate clustering of the samples taken from the same cSt-Cs culture. This means that the culture in which the cells were extract likely contributes to the variance in the gene expression. The observation here using the heatmap is consistent with the distribution of the samples from the MDS plot in Figure 2. Therefore, we will take this into account when we perform build the model for differential expression analysis.
For differential gene expression analysis, we use the
edgeR package, which implements a generailized linear model
(GLM) to test for differential expression.10 Based on the objective of our
analysis, we will build the model based on the treatment conditions.
Additionally, since we have observed a subset of cultures replicates
that shows a lower variance within cells from the same culture (Figure 2.) and a moderate degree of clustering in the
heatmap (Figure 3.), it is expected that the culture
replicate will also contribute to the variance in gene expression.
Therefore, we choose the model to include both factors, treatment and
culture replicate.
# Extract the factors for the model
treatments <- sample_matrix$treatment
cultures <- sample_matrix$culture
# Create the model
model_treat_cul <- model.matrix(~0 + treatments + cultures)
rownames(model_treat_cul) <- rownames(sample_matrix)
# View the model matrix
DT::datatable(model_treat_cul[order(gsub("M[0-9]{2}\\.", "", rownames(model_treat_cul))),
], rownames = TRUE, options = list(pageLength = 6, scrollX = TRUE)) %>%
# Changing the font size of content
DT::formatStyle(names(model_treat_cul), fontSize = "12px")Table 3. Model design for differential expression analysis. The model includes both treatment and culture replicate as factors for fitting the generalized linear model.
Furthermore, edgeR implements dispersion estimation with
the negative binomial model, which estimates the global variation across
the samples in a biological group.15 Correct estimation of dispersion
not only allows accurate assessment of the quality of the data, but it
is also a prerequisite for the differential expression analysis.15 10 Here we estimate and assess the
dispersion using the estimateDisp function.
# Estimate the dispersion
counts_dge <- edgeR::estimateDisp(counts_dge, model_treat_cul)# Plot the dispersion against the mean
edgeR::plotMeanVar(counts_dge, show.raw.vars = TRUE, show.tagwise.vars = TRUE, NBline = TRUE,
show.binned.common.disp.vars = TRUE)Figure 4. Mean-variance relationship of all genes across various treatment conditions. The dispersion is estimated using the negative binomial model. A blue line indicates a trended line based on the negative binomial model, with pooled and estimated gene-wise dispersion shown in gray and light blue, respectively. A black line with a slope of 1 represents the expected variance of a Poisson distribution, which assumes that the variance is equal to the mean.
Figure 4. illustrates the mean-variance relationship of the genes in the dataset across all genes and samples.16 A trended line based on the negative binomial model shows an increasing variance correlating to higher mean expression, suggesting the existence of biological variation. The pooled gene-wise dispersion is similar to that of the estimated dispersion, and both of which align relatively closely with the trend line.
Based on the model design and the estimated dispersion above, two differential gene expression analysis are performed separately, comparing the transcriptomic changes involved after the addition of VTP to the cells under maximal strain with and without the presence of TGF-β1. Here, we employ the gene-wise Quasi-likelihood F test with the negative binomial generalized linear model to test for differential expression. We choose the method because it has been suggested to account for gene-specific variations due to biological and technical factors and that it also reflects the uncertainty in dispersion estimation.15 This is recommended for bulk RNA-seq dataset with a complex design (i.e., the two-factor model we employed in this analysis, as shown in Table 3).15 10
First, we examine the effect of VTP in comparison to control without
the presence of TGF-β1. Based on the model, we establish a contrast
comparing the treatment of VTP alone to the control. This is implemented
in the limma package.17
# Create a contrast for VTP vs. control
contrast_vtp <- limma::makeContrasts("treatmentsVTP - treatmentsCTRL", levels = model_treat_cul)We then perform the differential expression analysis by fitting the generalized linear model with the factors we designed and performing the Quasi-likelihood F test as described above.
# Fit the model
glm_fit_treat_cul <- edgeR::glmQLFit(counts_dge, model_treat_cul)
# Perform the differential expression analysis
qlf_vtp_ctrl <- edgeR::glmQLFTest(glmfit = glm_fit_treat_cul, contrast = contrast_vtp)# Extract the results
results_vtp_ctrl <- edgeR::topTags(qlf_vtp_ctrl, sort.by = "PValue", n = nrow(counts_dge))$table
# Save the differential expression results
write.csv(results_vtp_ctrl, file = "./Data/dge_vtp_ctrl.csv", row.names = TRUE)We will now examine the genes that are differentially expressed.
# Differentially expressed genes based on unadjusted p-value
sum(results_vtp_ctrl$PValue < 0.05)## [1] 3651
Among the 14814 genes in the dataset, 3651 are considered to be differentially expressed based on the unadjusted p-value with a threshold of 0.05. This is consistent with other RNA-seq studies that have a focus on heart gene expression.18 Additionally, a threshold of 0.05 is also a common choice, since it is a conservative threshold that does not post a high risk of false positives and is not too stringent that may mask other biologically relevant genes.
Since paring RNA-seq expression data involves multiple hypothesis testing, we also adjust the p-values using the Benjamini-Hochberg method. This method assumes that the power to detect differential expression is equally likely among all the tests, which is suitable in the case of RNA-seq data, where the assumption is that most genes are not differentially expressed.19
# Differentially expressed genes based on adjusted p-value
sum(results_vtp_ctrl$FDR < 0.05)## [1] 1040
Ater adjusting for multiple hypothesis testing, a total of 1040 genes are considered to be truly differentially expressed. Since it is known that VTP is a direct inhibitor of the TAP/TAZ complex, which is involved in multiple pathways such as cell proliferation and migration, we expect that a certain amount of genes should be differentially expressed.2 7
# Retrieve the set of differentially expressed genes
dge_vtp_ctrl <- results_vtp_ctrl[results_vtp_ctrl$FDR < 0.05, ]# Up-regulated genes
vtp_ctrl_up <- dge_vtp_ctrl[dge_vtp_ctrl$logFC > 0, ]
# We add gene symbol as a column so that it is searchable
vtp_ctrl_up$Gene = rownames(vtp_ctrl_up)
vtp_ctrl_up <- vtp_ctrl_up[, c(ncol(vtp_ctrl_up), 1:ncol(vtp_ctrl_up) - 1)]
# Sort by fold change and visualize with a table
vtp_ctrl_up <- vtp_ctrl_up[order(vtp_ctrl_up$logFC, decreasing = TRUE), ]
DT::datatable(vtp_ctrl_up, rownames = FALSE, filter = "top", options = list(pageLength = 5,
scrollX = TRUE)) %>%
# Changing the font size of content
DT::formatStyle(colnames(vtp_ctrl_up), fontSize = "12px") %>%
# Changing the font size of row names (sample names)
DT::formatStyle(0, fontSize = "12px")Table 4. Up-regulated genes under VTP treatment comparing to control without TGF-β1.
# Down-regulated genes
vtp_ctrl_down <- dge_vtp_ctrl[dge_vtp_ctrl$logFC < 0, ]
# We add gene symbol as a column so that it is searchable
vtp_ctrl_down$Gene = rownames(vtp_ctrl_down)
vtp_ctrl_down <- vtp_ctrl_down[, c(ncol(vtp_ctrl_down), 1:ncol(vtp_ctrl_down) - 1)]
# Sort by fold change and visualize with a table
vtp_ctrl_down <- vtp_ctrl_down[order(vtp_ctrl_down$logFC, decreasing = FALSE), ]
DT::datatable(vtp_ctrl_down, rownames = FALSE, filter = "top", options = list(pageLength = 5,
scrollX = TRUE)) %>%
# Changing the font size of content
DT::formatStyle(colnames(vtp_ctrl_down), fontSize = "12px") %>%
# Changing the font size of row names (sample names)
DT::formatStyle(0, fontSize = "12px")Table 5. Down-regulated genes under VTP treatment comparing to control without TGF-β1.
Interestingly, there we found 605 up-regulated genes, which is more
than the number of down-regulated genes, 435. The top upregulated genes
are VWA5B2, BCYRN1, UNC13A, SLC14A1, PDE4C, which are not
associated with fibrosis or cell proliferation. Rather, the top
downregulated gene is COL11A1, with a fold change of 18.11.
This gene encodes collagen
type XI alpha 1 chain. which directly involves in the formation of
collagen fibrils.20 We can see
that a total of 0 collagen genes are downregualed, with none of them
upregulated. This suggests that VTP reduces the expression of collagen
genes, which are the main components of fibrotic tissues. We also found
that P4HA1 gene is among the most significantly
downregualted genes. The gene enxcodes a key subunit of the prolyl
4-hydroxylase complex, which is involved in a key step of collagen
synthesis.21 Among the
upregulated genes, we found multiple SLC family genes,
which are involved in membrane transport and are likely to contribute to
cellular respirations, which are known to be reduced in fibrotic
tissues.22
Here we visualize the differentially expressed genes using a Volcano plot, which allows for easy visualization of the genes accoding to their fold change and significance.
# Label genes as up/down-regulated
results_vtp_ctrl <- results_vtp_ctrl %>%
dplyr::mutate(Expression = dplyr::case_when(logFC > 0 & FDR < 0.05 ~ "Up-regulated",
logFC < 0 & FDR < 0.05 ~ "Down-regulated", TRUE ~ "Not DE"))
# Extract a sublist of inportant genes to label in the plot Subset
# differentially expressed genes
results_deg <- results_vtp_ctrl[results_vtp_ctrl$FDR < 0.05, ]
# 1. Collagen genes
genes_to_label <- results_deg[grepl("COL", rownames(results_deg)), ]
# 2. Prolyl 4-hydroxylase subunit
genes_to_label <- rbind(genes_to_label, results_deg[grepl("P4HA1", rownames(results_deg)),
])
# 3. Top 3 up and down-regulated genes based on fold change
results_sort_by_FC <- results_deg[order(results_deg$logFC), ]
genes_to_label <- rbind(genes_to_label, results_sort_by_FC[1:3, ], results_sort_by_FC[(nrow(results_sort_by_FC) -
2):nrow(results_sort_by_FC), ])
# Remove any duplicated genes in the subset
genes_to_label <- genes_to_label[!duplicated(rownames(genes_to_label)), ]# We will plot all genes, with highlighted genes representing the
# differentially expressed genes Color genes based on whether it is
# significantly up/down-regulated Label the genes of interest
ggplot2::ggplot(results_vtp_ctrl, aes(logFC, -log(FDR, 10))) + ggplot2::geom_point(aes(color = Expression),
size = 2/5) + xlab(expression("log"[2] * "FC")) + ylab(expression("-log"[10] *
"FDR")) + ggplot2::scale_color_manual(values = c(`Up-regulated` = "firebrick3",
`Down-regulated` = "dodgerblue3", `Not DE` = "gray50")) + ggplot2::guides(color = ggplot2::guide_legend(override.aes = list(size = 1.5))) +
ggplot2::geom_label(data = genes_to_label, mapping = aes(logFC, -log(FDR, 10),
label = rownames(genes_to_label)), size = 2, color = "black", nudge_x = 0.1,
nudge_y = 0.1)Figure 5. Volcano plot of differentially expressed genes comparing VTP treatment to control without TGF-β1. Differentially expressed genes are coloured based on direction of change. Genes of interest are labelled.
To better visualize the differentially expressed genes between the two test condition, a heatmap is plotted to support the visualization of gene expression, and particularly the clustering of samples.
# Extract the differentially expressed genes from the heatmap matrix
heatmap_matrix_vtp_ctrl <- heatmap_matrix[rownames(dge_vtp_ctrl), grepl("\\.VTP|\\.CTRL",
colnames(heatmap_matrix))]
# Checking the range of the heatmap matrix
hmap_range <- range(heatmap_matrix_vtp_ctrl)
# Set the 3 color palette based on the scaled heatmap matrix
hmap_col <- circlize::colorRamp2(c(hmap_range[1], 0, hmap_range[2]), c("blue", "white",
"red"))
# Redefine 2 colours for the 4 treatment conditions
treatment_col <- hcl.colors(2, palette = "viridis")
names(treatment_col) <- c("VTP", "CTRL")
# Create an annotation dataframe
anno_df <- data.frame(treatment = sample_matrix_sorted[grepl("\\.VTP|\\.CTRL", rownames(sample_matrix_sorted)),
"treatment"], culture = sample_matrix_sorted[grepl("\\.VTP|\\.CTRL", rownames(sample_matrix_sorted)),
"culture"])
anno_col <- list(treatment = treatment_col, culture = culture_col)
# Create the heatmap annotation
hmap_anno <- ComplexHeatmap::HeatmapAnnotation(df = anno_df, col = anno_col)
# Plot the heatmap
hmap_vtp_ctrl <- ComplexHeatmap::Heatmap(heatmap_matrix_vtp_ctrl, show_row_dend = TRUE,
show_column_dend = TRUE, show_column_names = TRUE, show_row_names = FALSE, col = hmap_col,
top_annotation = hmap_anno)
hmap_vtp_ctrlFigure 6. Heatmap of differentially expressed genes comparing VTP treatment to control without TGF-β1. Samples are annotated based on treatment conditions and culture replicates.
Additionally, we want to investigate whether VTP can still negates the fibrotic effect under the potent profibrotic stimulus of TGF-β1. Therefore, we also establish a contrast comparing the treatment of VTP to control where both conditions are also subjected to the same TGF-β1 treatment.
# Create a contrast for VTP vs. control with TGF-β1
contrast_vtp_tgf <- limma::makeContrasts("treatmentsTGFb_VTP - treatmentsTGFb", levels = model_treat_cul)
# Perform differential expression analysis
results_vtp_tgf <- edgeR::glmQLFTest(glmfit = glm_fit_treat_cul, contrast = contrast_vtp_tgf)# Extract the results
results_vtp_tgf <- edgeR::topTags(results_vtp_tgf, sort.by = "PValue", n = nrow(counts_dge))$table
# Save the differential expression results
write.csv(results_vtp_ctrl, file = "./Data/dge_vtp_tgf.csv", row.names = TRUE)There are 14 differentially expressed genes comparing
VTP treatment to control in the presence of TGF-β1. Similar to the
previous analysis, we used an threshold of 0.05 for the
false discovery rate (FDR), as corrected by the Benjamini-Hochberg
method, to determine whether a gene is considered differentially
expressed. Among these differentially expressed genes, there are
3 genes that are up-regulated and 11 genes
that are down-regulated, as shown below.
# Retrieve the set of differentially expressed genes
dge_vtp_tgf <- results_vtp_tgf[results_vtp_tgf$FDR < 0.05, ]
# Up-regulated genes
vtp_tgf_up <- dge_vtp_tgf[dge_vtp_tgf$logFC > 0, ]
# We add gene symbol as a column so that it is searchable
vtp_tgf_up$Gene <- rownames(vtp_tgf_up)
vtp_tgf_up <- vtp_tgf_up[, c(ncol(vtp_tgf_up), 1:(ncol(vtp_tgf_up) - 1))]
# Sort the genes by fold change and visualize using a table
vtp_tgf_up <- vtp_tgf_up[order(vtp_tgf_up$logFC, decreasing = TRUE), ]
DT::datatable(vtp_tgf_up, rownames = FALSE, filter = "top", options = list(pageLength = 5,
scrollX = TRUE)) %>%
DT::formatStyle(colnames(vtp_ctrl_up), fontSize = "12px") %>%
DT::formatStyle(0, fontSize = "12px")Table 6. Up-regulated genes under VTP treatment in the presence of TGF-β1.
# Down-regulated genes
vtp_tgf_down <- dge_vtp_tgf[dge_vtp_tgf$logFC < 0, ]
# We add gene symbol as a column so that it is searchable
vtp_tgf_down$Gene <- rownames(vtp_tgf_down)
vtp_tgf_down <- vtp_tgf_down[, c(ncol(vtp_tgf_down), 1:(ncol(vtp_tgf_down) - 1))]
# Sort the genes by fold change and visualize using a table
vtp_tgf_down <- vtp_tgf_down[order(vtp_tgf_down$logFC), ]
DT::datatable(vtp_tgf_down, rownames = FALSE, filter = "top", options = list(pageLength = 5,
scrollX = TRUE)) %>%
DT::formatStyle(colnames(vtp_ctrl_down), fontSize = "12px") %>%
DT::formatStyle(0, fontSize = "12px")Table 7. Down-regulated genes under VTP treatment in the presence of TGF-β1.
Since there are only 14 differentially expressed genes
based on the corrected FDR threshold, the Volcano plot is will likely
show mostly noise. However, we will only annotate the top 3 genes for
both the up-regulated and down-regulated genes based on the fold
change.
# Label genes as up-regulated or down-regulated
results_vtp_tgf <- results_vtp_tgf %>%
dplyr::mutate(Expression = dplyr::case_when(logFC > 0 & FDR < 0.05 ~ "Up-regulated",
logFC < 0 & FDR < 0.05 ~ "Down-regulated", TRUE ~ "Not DE"))
# Label differentially expressed genes
genes_to_label <- results_vtp_tgf[results_vtp_tgf$Expression != "Not DE", ]
# Plot the volcano plot Color genes based on whether it is significantly
# up/down-regulated
ggplot2::ggplot(results_vtp_tgf, aes(logFC, -log(FDR, 10))) + ggplot2::geom_point(aes(color = Expression),
size = 2/5) + xlab(expression("log"[2] * "FC")) + ylab(expression("-log"[10] *
"FDR")) + ggplot2::scale_color_manual(values = c(`Up-regulated` = "firebrick3",
`Down-regulated` = "dodgerblue3", `Not DE` = "gray50")) + ggplot2::guides(color = ggplot2::guide_legend(override.aes = list(size = 1.5))) +
ggplot2::geom_label(data = genes_to_label, mapping = aes(logFC, -log(FDR, 10),
label = rownames(genes_to_label)), size = 2, color = "black", nudge_x = 0.1,
nudge_y = 0.1)Figure 7. Volcano plot of differentially expressed genes under VTP treatment in the presence of TGF-β1. Red and blue dots represent up-regulated and down-regulated genes, respectively. All differentially expressed genes are labeled.
# Extract the differentially expressed genes from the heatmap matrix
heatmap_matrix_vtp_tgf <- heatmap_matrix[rownames(dge_vtp_tgf), grepl("\\.TGFb_VTP|\\.TGFb",
colnames(heatmap_matrix))]
# Check the range of the heatmap matrix
hmap_range <- range(heatmap_matrix_vtp_tgf)
# Set the 3 color palette based on the scaled heatmap matrix
hmap_col <- circlize::colorRamp2(c(hmap_range[1], 0, hmap_range[2]), c("blue", "white",
"red"))
# Redefine 2 colours for the 4 treatment conditions
treatment_col <- hcl.colors(2, palette = "viridis")
names(treatment_col) <- c("TGFb_VTP", "TGFb")
# Create an annotation dataframe
anno_df <- data.frame(treatment = sample_matrix_sorted[grepl("\\.TGFb_VTP|\\.TGFb",
rownames(sample_matrix_sorted)), "treatment"], culture = sample_matrix_sorted[grepl("\\.TGFb_VTP|\\.TGFb",
rownames(sample_matrix_sorted)), "culture"])
anno_col <- list(treatment = treatment_col, culture = culture_col)
# Create the heatmap annotation
hmap_anno <- ComplexHeatmap::HeatmapAnnotation(df = anno_df, col = anno_col)
# Plot the heatmap
hmap_vtp_tgf <- ComplexHeatmap::Heatmap(heatmap_matrix_vtp_tgf, show_row_dend = TRUE,
show_column_dend = TRUE, show_column_names = TRUE, show_row_names = TRUE, col = hmap_col,
top_annotation = hmap_anno)
hmap_vtp_tgfFigure 8. Heatmap of differentially expressed genes under VTP treatment in the presence of TGF-β1.
How many genes were significantly differentially expressed? What thresholds did you use and why?
Multiple hypothesis testing - correct your p-values using a multiple hypothesis correction method. Which method did you use? And Why? How many genes passed correction?
Show the amount of differentially expressed genes using an MA Plot or a Volcano plot. Highlight genes of interest.
Visualize your top hits using a heatmap. Do you conditions cluster together? Explain why or why not.
Which method did you choose and why?
What annotation data did you use and why? What version of the annotation are you using?
How many genesets were returned with what thresholds?
Run the analysis using the up-regulated set of genes, and the down-regulated set of genes separately. How do these results compare to using the whole list (i.e all differentially expressed genes together vs. the up-regulated and down regulated differentially expressed genes separately)?
Do the over-representation results support conclusions or mechanism discussed in the original paper?
Can you find evidence, i.e. publications, to support some of the results that you see. How does this evidence support your results.
The link to the concurrent journal entry can be found here.